Language Independent Feature Extractor

نویسندگان

  • Young-Seob Jeong
  • Ho-Jin Choi
چکیده

We propose a new customizable tool, Language Independent Feature Extractor (LIFE), which models the inherent patterns of any language and extracts relevant features of the language. There are two contributions of this work: (1) no labeled data is necessary to train LIFE (It works when a sufficient number of unlabeled documents are given), and (2) LIFE is designed to be applicable to any language. Although there are some studies that aim to design language independent feature extractors, we argue that most of them are not truly language independent. First, many works depend on some other resources or tools (e.g., WordNet) which themselves are inherently language-specific. In (Steinberger, Pouliquen, and Ignat 2006), many resources and features were employed, and a huge effort will be required to apply it to other languages. (Curran and Clark 2003) defined word-level features, alphabet-level features, and some features obtained from a gazetteer which again is language-specific. Many of these approaches are available only when these resources are constructed using the target languages. Second, most of these studies are applicable to alphabetbased languages (e.g., English), but not to non-alphabetbased languages (e.g., Korean, Chinese) because the characteristic difference between the two types of language is not considered. For example, Part-Of-Speech (POS) tags are usually allocated to each morpheme in Korean, while the POS tags in English are allocated to each word(token). There can be no blank space between morphemes or words in Korean and Chinese, while in English, a blank space separates two words. In order to design a truly language independent feature extractor, analysis of documents should not depend on such language-specific assumptions. There are several approaches (Chen et al. 2010; Jing et al. 2003) that do not depend on such language-specific assumption of word definition. Instead, they employ letterlevel features. (Jing et al. 2003) compares letter-level, wordlevel, and class-level features by their performances in NER task on Chinese language. The class-level features are defined to be the class tags of words such as numbers, Chinese names, foreign names, etc. In this work, the letter-level

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-lingual portability of MLP-based tandem features - a case study for English and Hungarian

One promising approach for building ASR systems for lessresourced languages is cross-lingual adaptation. Tandem ASR is particularly well suited to such adaptation, as it includes two cascaded modelling steps: feature extraction using multi-layer perceptrons (MLPs), followed by modelling using a standard HMM. The language-specific tuning can be performed by adjusting the HMM only, leaving the ML...

متن کامل

A Chinese sign language recognition system based on SOFM/SRN/HMM

In sign language recognition (SLR), the major challenges now are developing methods that solve signer-independent continuous sign problems. In this paper, SOFM/HMM is first presented for modeling signer-independent isolated signs. The proposed method uses the self-organizing feature maps (SOFM) as different signers’ feature extractor for continuous hidden Markov models (HMM) so as to transform ...

متن کامل

Feature Extraction (Image Compression) of Printed Gujarati and Amharic Letters Using Discrete Wavelet Transform

This paper demonstrates an application of discrete wavelet transform as a feature extractor as applicable to natural language processing. The procedure discussed in this paper extracts important features of the printed characters of Gujarati and Amharic scripts and shows Image compression capabilities of discrete wavelets. Procedure prescribed in this paper compresses the original image to 75% ...

متن کامل

RExtractor: a Robust Information Extractor

The RExtractor system is an information extractor that processes input documents by natural language processing tools and consequently queries the parsed sentences to extract a knowledge base of entities and their relations. The extraction queries are designed manually using a tool that enables natural graphical representation of queries over dependency trees. A workflow of the system is design...

متن کامل

Fast Independent Component Analysis in Kernel Feature Spaces

It is common practice to apply linear or nonlinear feature extraction methods before classification. Usually linear methods are faster and simpler than nonlinear ones but an idea successfully employed in the nonlinearization of Support Vector Machines permits a simple and effective extension of several statistical methods to their nonlinear counterparts. In this paper we follow this general non...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015